SAS and R Integration for Machine Learning

This notebook represents an example of how you can use SAS Viya with R for analysis.In this example, we will import R packages, start a CAS Session, load data from the local file system into CAS, explore the data, impute missing values, create several models in R, create several models in CAS, score a test set using our models, and assess model performance.
Further documentation on using SWAT with R can be found here:


Scripting Wrapper for Analytics Transfer

The SAS Scripting Wrapper for Analytics Transfer (SWAT) package for R is a R interface to SAS Cloud Analytic Services (CAS) which is the centerpiece of the SAS Viya framework. With this package, you can load data into memory and apply CAS actions to transform, summarize, model and score the data. Result tables from the actions are a superclass of data frame, enabling you to apply your existing R programming skills to further post-process CAS result tables.


Set Up the R Notebook for Analysis

First, we will load the packages we want to use.


In [1]:
# Load necessary packages
library('swat')
library('ggplot2')
library('reshape2')
library('rpart')
library('randomForest')
library('xgboost')
options(cas.print.messages = FALSE)
options(warn=-1)
sink()


SWAT 1.4.0
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:ggplot2’:

    margin

Now we can create our connection to CAS. Please see this documentation on connecting and starting CAS Sessions for more information.


In [2]:
conn <- CAS('localhost', port=5570, caslib = 'casuser')


NOTE: Connecting to CAS and generating CAS action functions for loaded
      action sets...
NOTE: To generate the functions with signatures (for tab completion), set 
      options(cas.gen.function.sig=TRUE).

Next, we need to load our CAS Action Sets. Please see this documentation on running CAS Actions for more information.


In [3]:
actionsets <- c('sampling', 'fedsql', 'decisionTree', 'percentile', 'autotune', 'regression')
for(i in actionsets){
    loadActionSet(conn, i)
}

Finally, we can load our data from a CSV file.


In [4]:
castbl <- cas.read.csv(conn, './data/hmeq.csv')

Explore the Data

Let us begin exploring our data. Just like in R, we can view the first few rows of our CAS data table using the head function.


In [5]:
head(castbl)


A casDataFrame: 6 × 13
BADLOANMORTDUEVALUEREASONJOBYOJDEROGDELINQCLAGENINQCLNODEBTINC
<dbl><dbl><dbl><dbl><chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1110025860 39025HomeImpOther 10.5 0 0 94.36667 1 9 NaN
1130070053 68400HomeImpOther 7.0 0 2121.83333 0 14 NaN
1150013500 16700HomeImpOther 4.0 0 0149.46667 1 10 NaN
11500 NaN NaN NaNNaNNaN NaNNaNNaN NaN
0170097800112000HomeImpOffice 3.0 0 0 93.33333 0 14 NaN
1170030548 40320HomeImpOther 9.0 0 0101.46600 1 837.11361

Visual exploration helps us better understand patterns and distributions within our data. We can use ggplot to look at our data distributions.


In [6]:
# Bring data locally
df <- to.casDataFrame(castbl, obs = nrow(castbl))
# Use reshape2's melt to help with data formatting
d <- melt(df[sapply(df, is.numeric)], id.vars=NULL)
ggplot(d, aes(x = value)) +
    facet_wrap(~variable,scales = 'free_x') +
    geom_histogram(fill = 'blue', bins = 25)


From our plots above, we can see that most of our home equity loans are not bad, meaning that they must be good. We can also see that most of our variables have a slight right skew. Let’s keep exploring our data by checking the missing values.


In [7]:
# Get the number of missing values for all variables
tbl <- cas.simple.distinct(castbl)$Distinct[,c('Column', 'NMiss')]
tbl


A data.frame: 13 × 2
ColumnNMiss
<chr><dbl>
BAD 0
LOAN 0
MORTDUE 518
VALUE 112
REASON 252
JOB 279
YOJ 515
DEROG 708
DELINQ 580
CLAGE 308
NINQ 510
CLNO 222
DEBTINC1267

In [8]:
# Easy way to get missing values for numeric variables
cas.nmiss(castbl)


BAD
0
LOAN
0
MORTDUE
518
VALUE
112
YOJ
515
DEROG
708
DELINQ
580
CLAGE
308
NINQ
510
CLNO
222
DEBTINC
1267

In [9]:
# Visualize the missing data
tbl$PctMiss <- tbl$NMiss/nrow(castbl)
ggplot(tbl, aes(Column, PctMiss)) +
    geom_col(fill = 'blue') +
    ggtitle('Pct Missing Values') +
    theme(plot.title = element_text(hjust = 0.5))


Using both R and SWAT, we have explored our data, and now we have an idea of what to look for when we clean our data.


Clean the Data

First, we should impute those missing values.


In [10]:
# Impute missing values
cas.dataPreprocess.impute(castbl,
    methodContinuous = 'MEDIAN',
    methodNominal = 'MODE',
    inputs = colnames(castbl)[-1],
    copyAllVars = TRUE,
    casOut = list(name = 'hmeq', 
                replace = TRUE)
)


$ImputeInfo
A casDataFrame: 12 × 7
VariableImputeTechResultVarNNMissImputedValueContinuousImputedValueNominal
<chr><chr><chr><dbl><dbl><dbl><chr>
LOAN MedianIMP_LOAN 5960 016300.00000
MORTDUEMedianIMP_MORTDUE5442 51865019.00000
VALUE MedianIMP_VALUE 5848 11289235.50000
REASON Mode IMP_REASON 5708 252 NaNDebtCon
JOB Mode IMP_JOB 5681 279 NaNOther
YOJ MedianIMP_YOJ 5445 515 7.00000
DEROG MedianIMP_DEROG 5252 708 0.00000
DELINQ MedianIMP_DELINQ 5380 580 0.00000
CLAGE MedianIMP_CLAGE 5652 308 173.46667
NINQ MedianIMP_NINQ 5450 510 1.00000
CLNO MedianIMP_CLNO 5738 222 20.00000
DEBTINCMedianIMP_DEBTINC46931267 34.81826
$OutputCasTables
A casDataFrame: 1 × 4
casLibNameRowsColumns
<chr><chr><dbl><dbl>
CASUSER(sasdemo05)hmeq596025

Now, we should partition our data into training and testing sets.


In [11]:
# Partition the data
cas.sampling.srs(conn,
    table = 'hmeq',
    samppct = 30,
    partind = TRUE,
    output = list(casOut = list(name = 'hmeq', replace = T), copyVars = 'ALL')
)


$outputSize
$outputNObs
5960
$outputNVars
26
$SRSFreq
A casDataFrame: 1 × 2
NObsNSamp
<dbl><dbl>
59601788
$OutputCasTables
A casDataFrame: 1 × 5
casLibNameLabelRowsColumns
<chr><chr><chr><dbl><dbl>
CASUSER(sasdemo05)hmeq596026

Finally, let’s map our variables to build reusable labels for our function calls.


In [12]:
#Note: I do not want to hard code any of my variable names.
indata <- 'hmeq'

# Get variable info and types
colinfo <- head(cas.table.columnInfo(conn, table = indata)$ColumnInfo, -1)

# My target variable is the first column
target <- colinfo$Column[1]

# For models that can inherently handle missing values (ex: Decision Tree)
inputs <- colinfo$Column[-1]
nominals <- c(target, subset(colinfo, Type == 'varchar')$Column)

# For models that can't handle missing values
imp_inputs =  c("IMP_CLAGE", "IMP_CLNO", "IMP_DEBTINC", "IMP_DELINQ",  
                   "IMP_DEROG", "IMP_LOAN", "IMP_MORTDUE", "IMP_NINQ", "IMP_VALUE", 
                   "IMP_YOJ", "IMP_JOB", "IMP_REASON")
imp_nominals = c("IMP_JOB", "IMP_REASON")

Build and Score R Models

To run R models, the data must be taken from the in-memory CAS table and placed into a R data frame.


In [13]:
# Connect to in-memory CAS table
hmeq1 <- defCasTable(conn, tablename="HMEQ")

In [14]:
# Create CAS DataFrame from CAS table
df1 = to.casDataFrame(hmeq1)

In [15]:
# Create R DataFrame from CAS DataFrame
df1 = to.data.frame(df1)

In [16]:
# Rename partition indicator variable 
names(df1)[length(names(df1))]<-"part" 
# Make dummy variables 
df1$reason_debtcon <- ifelse(df1$REASON == 'DebtCon', 1, 0)
df1$job_Office <- ifelse(df1$JOB == "Office", 1, 0)
df1$job_Mgr <- ifelse(df1$JOB == 'Mgr', 1,  0)
df1$job_ProfExe <- ifelse(df1$JOB == 'ProfExe', 1, 0)

In [17]:
# Keep only imputed numeric data and target
df2 = subset(df1, select=c(BAD, IMP_CLAGE, IMP_CLNO, IMP_DEBTINC, IMP_DELINQ, 
                           IMP_DEROG, IMP_LOAN, IMP_MORTDUE, IMP_NINQ, IMP_VALUE, 
                           IMP_YOJ, part, reason_debtcon, job_Office, job_Mgr, job_ProfExe))
# Split into train and test data frames 
train = subset(df2, part==0)
train = subset(train, select = -c(part))
test = subset(df2, part==1)
test = subset(test, select = -c(part))
# Save actual values for test to use in assessment
actual = test$BAD

Great, now let’s begin modeling. I am looking at three models: a logistic regression, a decision tree, and a gradient boosting model. For each model, I will use a R function to build the model and score the test data set, but I will use SAS’s assessment function to make it easy to assess all models side-by-side.

R Logisic Regression


In [18]:
# Build logistic regression
rlog <- glm(BAD ~ ., family="binomial", data=train)

In [19]:
# Score test data 
rlog_scored <- predict(rlog, test, type="response")
# Create dataframe holding predicted values and results
rlog_scored  <- cbind(rlog_scored , actual)
rlog_scored  <- as.data.frame(rlog_scored)
# Save R dataframe to CAS table
rlog_scored <- as.casTable(conn, rlog_scored, casOut='rlog_scored')

In [20]:
# Assess performance
rlog_assessed <- cas.percentile.assess(conn,
        table    = list(name = 'rlog_scored'),
        inputs = 'rlog_scored',
        response = 'actual',
        event    = '1')
rlog_roc <- rlog_assessed$ROCInfo
rlog_roc$Model <- "R Logistic Regression"

R Decision Tree


In [21]:
# Build decision tree
rtree <- rpart(BAD ~ ., method="class", data=train)

In [22]:
# Score test data 
rtree_scored <- predict(rtree, test, type="prob")
# Create dataframe holding predicted values and results
rtree_scored  <- cbind(rtree_scored , actual)
rtree_scored  <- as.data.frame(rtree_scored)
names(rtree_scored) <- c("p_0", "p_1", "actual")
rtree_scored <- subset(rtree_scored, select = -c(p_0))
# Save R dataframe to CAS table
rtree_scored <- as.casTable(conn, rtree_scored, casOut='rtree_scored')

In [23]:
# Assess performance
rtree_assessed <- cas.percentile.assess(conn,
        table    = list(name = 'rtree_scored'),
        inputs = 'p_1',
        response = 'actual',
        event    = '1')
rtree_roc <- rtree_assessed$ROCInfo
rtree_roc$Model <- "R Decision Tree"

R Gradient Boosting


In [24]:
# Prepare data for xgboost package
# Saving train target
# Test target aleady stored as actual 
labels <- train$BAD
mtrain <- subset(train, select = -c(BAD))
mtest <- subset(test, select = -c(BAD))
# Convernting train and test dataframes to matrix
mtrain <- as.matrix(mtrain)
mtest <- as.matrix(mtest)
# Converting train and test matrices to DMtraix 
xgtrain <- xgb.DMatrix(data=mtrain, label=labels)
xgtest <- xgb.DMatrix(data=mtest, label=actual)

In [27]:
# Build gradient boosting
rboost <- xgboost(data = xgtrain, label = labels, objective = "binary:logistic", nrounds = 100)


[1]	train-error:0.104746 
[2]	train-error:0.103068 
[3]	train-error:0.096596 
[4]	train-error:0.093241 
[5]	train-error:0.086769 
[6]	train-error:0.081975 
[7]	train-error:0.079338 
[8]	train-error:0.078140 
[9]	train-error:0.073346 
[10]	train-error:0.072867 
[11]	train-error:0.069271 
[12]	train-error:0.066155 
[13]	train-error:0.061601 
[14]	train-error:0.061361 
[15]	train-error:0.059204 
[16]	train-error:0.055369 
[17]	train-error:0.052493 
[18]	train-error:0.050815 
[19]	train-error:0.048418 
[20]	train-error:0.047220 
[21]	train-error:0.043384 
[22]	train-error:0.042665 
[23]	train-error:0.041227 
[24]	train-error:0.039070 
[25]	train-error:0.037872 
[26]	train-error:0.037152 
[27]	train-error:0.034516 
[28]	train-error:0.031640 
[29]	train-error:0.029003 
[30]	train-error:0.028044 
[31]	train-error:0.025647 
[32]	train-error:0.024928 
[33]	train-error:0.025407 
[34]	train-error:0.023250 
[35]	train-error:0.023250 
[36]	train-error:0.021812 
[37]	train-error:0.020853 
[38]	train-error:0.018696 
[39]	train-error:0.018217 
[40]	train-error:0.017977 
[41]	train-error:0.017498 
[42]	train-error:0.016059 
[43]	train-error:0.015101 
[44]	train-error:0.015580 
[45]	train-error:0.015340 
[46]	train-error:0.013663 
[47]	train-error:0.013183 
[48]	train-error:0.012943 
[49]	train-error:0.011985 
[50]	train-error:0.011266 
[51]	train-error:0.011505 
[52]	train-error:0.011266 
[53]	train-error:0.011026 
[54]	train-error:0.010786 
[55]	train-error:0.010067 
[56]	train-error:0.010067 
[57]	train-error:0.009827 
[58]	train-error:0.008389 
[59]	train-error:0.008869 
[60]	train-error:0.008869 
[61]	train-error:0.008629 
[62]	train-error:0.006232 
[63]	train-error:0.005992 
[64]	train-error:0.006232 
[65]	train-error:0.006232 
[66]	train-error:0.006232 
[67]	train-error:0.005992 
[68]	train-error:0.005273 
[69]	train-error:0.004794 
[70]	train-error:0.004314 
[71]	train-error:0.003595 
[72]	train-error:0.003835 
[73]	train-error:0.003356 
[74]	train-error:0.002876 
[75]	train-error:0.001918 
[76]	train-error:0.001918 
[77]	train-error:0.001918 
[78]	train-error:0.001918 
[79]	train-error:0.001918 
[80]	train-error:0.001918 
[81]	train-error:0.001918 
[82]	train-error:0.001918 
[83]	train-error:0.001678 
[84]	train-error:0.001438 
[85]	train-error:0.001438 
[86]	train-error:0.001198 
[87]	train-error:0.000959 
[88]	train-error:0.000959 
[89]	train-error:0.000959 
[90]	train-error:0.000959 
[91]	train-error:0.000959 
[92]	train-error:0.000719 
[93]	train-error:0.000719 
[94]	train-error:0.000959 
[95]	train-error:0.000959 
[96]	train-error:0.000719 
[97]	train-error:0.000719 
[98]	train-error:0.000479 
[99]	train-error:0.000479 
[100]	train-error:0.000479 

In [28]:
# Score test data
rboost_scored <- predict(rboost, xgtest)
# Create dataframe holding predicted values and results
rboost_scored <- data.frame(rboost_scored, actual)
rboost_scored <- as.data.frame(rboost_scored)
# Save R dataframe to CAS table
rboost_scored <- as.casTable(conn, rboost_scored, casOut='rboost_scored')

In [29]:
# Assess performance
rboost_assessed <- cas.percentile.assess(conn,
        table    = list(name = 'rboost_scored'),
        inputs = 'rboost_scored',
        response = 'actual',
        event    = '1')
rboost_roc <- rboost_assessed$ROCInfo
rboost_roc$Model <- "R Gradient Boosting"

Build and Score CAS Models

Now we can do the same thing in CAS: build, score, and assess a logistic regression, a decision tree, and a gradient boosting model. In addition, CAS has the option to autotune the tree-based models, so I would like to include an example of that as well! Autotuning finds the best combination of hyperparameter to increase model accuracy.

CAS Logisic Regression


In [30]:
# Build CAS Logistic Regression
cas.regression.logistic(conn,
    table = list(name = indata),
    class = imp_nominals,
    model = list(
        depVar = target,
        effects = imp_inputs),
    selection=list(method="BACKWARD"),
    store=list(name='log_model', replace=TRUE), 
    output=list(casOut=list(name='log_score',replace=TRUE),  
                pred='pred', resChi='reschi', into='into',
                copyVars=list(target, '_PartInd_')), 
    partByVar= list(name = "_partind_", train = "0", validate = "1")
)


$ModelInfo
A casDataFrame: 5 × 3
RowIdDescriptionValue
<chr><chr><chr>
DATA Data Source HMEQ
RESPONSEVARResponse Variable BAD
DIST Distribution Binary
LINK Link Function Logit
TECH Optimization TechniqueNewton-Raphson with Ridging
$NObs
A casDataFrame: 2 × 5
RowIdDescriptionValueTrainingValidation
<chr><chr><dbl><dbl><dbl>
NREADNumber of Observations Read596041721788
NUSEDNumber of Observations Used596041721788
$ResponseProfile
A casDataFrame: 2 × 7
OrderedValueOutcomeBADFreqTrainingValidationModeled
<int><chr><dbl><dbl><dbl><dbl><chr>
100477133311440*
2111189 841 348
$ClassInfo
A casDataFrame: 2 × 3
ClassLevelsValues
<chr><dbl><chr>
IMP_JOB 6Mgr Office Other ProfExe Sales Self
IMP_REASON2DebtCon HomeImp
$SelectionInfo
A casDataFrame: 5 × 4
RowIdDescriptionValueNValue
<chr><chr><chr><dbl>
METHOD Selection Method BackwardNaN
SELCRITERION Select Criterion SBC NaN
STOPCRITERIONStop Criterion SBC NaN
HIERARCHY Effect Hierarchy EnforcedSingle NaN
STOPHORIZON Stop Horizon 3 3
$Summary.ConvergenceStatus
A casDataFrame: 1 × 3
ReasonStatusMaxGradient
<chr><int><dbl>
Convergence criterion (FCONV=1E-7) satisfied.00.0009467803
$Summary.SelectionSummary
A casDataFrame: 7 × 6
ControlStepEffectRemovednEffectsInSBCOptSBC
<chr><int><chr><int><dbl><int>
0 133363.0550
-1IMP_YOJ 123357.8630
2IMP_JOB 113352.9740
3IMP_MORTDUE103349.0441
4IMP_REASON 93349.6040
5IMP_CLNO 83354.4640
6IMP_LOAN 73372.2090
$Summary.StopReason
A casDataFrame: 1 × 2
ReasonCode
<chr><int>
Selection stopped at a local minimum of the STOP criterion.6
$Summary.SelectionReason
A casDataFrame: 1 × 1
Reason
<chr>
The model at step 3 is selected.
$Summary.SelectedEffects
A casDataFrame: 1 × 2
LabelEffects
<chr><chr>
Selected Effects:Intercept IMP_CLAGE IMP_CLNO IMP_DEBTINC IMP_DELINQ IMP_DEROG IMP_LOAN IMP_NINQ IMP_VALUE IMP_REASON
$SelectedModel.Dimensions
A casDataFrame: 5 × 3
RowIdDescriptionValue
<chr><chr><int>
NDESIGNCOLSColumns in Design 11
NEFFECTS Number of Effects 10
MAXEFCOLS Max Effect Columns 2
DESIGNRANK Rank of Design 10
OPTPARM Parameters in Optimization10
$SelectedModel.GlobalTest
A casDataFrame: 1 × 4
TestDFChiSqProbChiSq
<chr><int><dbl><dbl>
Likelihood Ratio9927.84386.187011e-194
$SelectedModel.FitStatistics
A casDataFrame: 11 × 4
RowIdDescriptionTrainingValidation
<chr><chr><dbl><dbl>
M2LL -2 Log Likelihood 3265.74801451404.7441937
AIC AIC (smaller is better) 3285.74801451424.7441937
AICC AICC (smaller is better) 3285.80088641424.8679978
SBC SBC (smaller is better) 3349.10952271479.6327232
ASE Average Square Error 0.1208238 0.1208004
M2LLNULL-2 Log L (Intercept-only)4193.59177961762.4978732
RSQUARE R-Square 0.1994032 0.1813396
ADJRSQ Max-rescaled R-Square 0.3145064 0.2892936
MCFADDENMcFadden's R-Square 0.2212528 0.2029811
MISCLASSMisclassification Rate 0.1598754 0.1610738
DIFFMEANDifference of Means 0.2505009 0.2394360
$SelectedModel.ParameterEstimates
A casDataFrame: 11 × 9
EffectIMP_REASONParameterParmNameDFEstimateStdErrChiSqProbChiSq
<chr><chr><chr><chr><int><dbl><dbl><dbl><dbl>
Intercept Intercept Intercept 1 2.621205e+002.886291e-01 82.4748901.070211e-19
IMP_CLAGE IMP_CLAGE IMP_CLAGE 1 5.607892e-036.503098e-04 74.3633136.498554e-18
IMP_CLNO IMP_CLNO IMP_CLNO 1 1.606460e-024.998012e-03 10.3310731.308091e-03
IMP_DEBTINC IMP_DEBTINC IMP_DEBTINC 1-6.249575e-027.415592e-03 71.0246743.527853e-17
IMP_DELINQ IMP_DELINQ IMP_DELINQ 1-7.563151e-014.540623e-02277.4435332.708342e-62
IMP_DEROG IMP_DEROG IMP_DEROG 1-6.573445e-015.903230e-02123.9957328.442498e-29
IMP_LOAN IMP_LOAN IMP_LOAN 1 2.135606e-054.873157e-06 19.2053091.173865e-05
IMP_NINQ IMP_NINQ IMP_NINQ 1-1.649363e-012.431238e-02 46.0232961.168551e-11
IMP_VALUE IMP_VALUE IMP_VALUE 1-1.258613e-069.087475e-07 1.9182191.660531e-01
IMP_REASON DebtConIMP_REASON DebtConIMP_REASON_DebtCon1 2.940699e-019.895972e-02 8.8304772.962409e-03
IMP_REASON HomeImpIMP_REASON HomeImpIMP_REASON_HomeImp0 0.000000e+00 NaN NaN NaN
$Timing
A casDataFrame: 9 × 4
RowIdTaskTimeRelTime
<chr><chr><dbl><dbl>
SETUP Setup and Parsing 0.04091978070.068626965
LEVELIZATION Levelization 0.04992103580.083723058
INITIALIZATIONModel Initialization0.02073287960.034771315
SSCP SSCP Computation 0.04630398750.077656871
FITTING Model Selection 0.30058908460.504120897
OUTPUT Creating Output Data0.02261209490.037922966
STORE Model Storing 0.11562395100.193914060
CLEANUP Cleanup 0.00081491470.001366701
TOTAL Total 0.59626388551.000000000
$OutputCasTables
A casDataFrame: 2 × 5
casLibNameLabelRowsColumns
<chr><chr><chr><dbl><dbl>
CASUSER(sasdemo05)log_model 12
CASUSER(sasdemo05)log_score59605

In [31]:
# Score Test Data
log_score1 <- defCasTable(conn, tablename="log_score")
log_score1 = to.casDataFrame(log_score1)
log_score1 = to.data.frame(log_score1)
log_score1$pred1 = 1-log_score1$pred
log_sc1 <- as.casTable(conn, log_score1, casOut='log_sc1')

In [32]:
# Assess Performance
log_assessed <- cas.percentile.assess(conn,
                table = list(name = 'log_sc1', where = '_PartInd_ = 1'),
                inputs = 'pred1',
                response = target,
                event = '1')
log_roc <- log_assessed$ROCInfo
log_roc$Model <- "CAS Logistic Regression"

CAS Decision Tree


In [33]:
# Build CAS Decision Tree
cas.decisionTree.dtreeTrain(conn,
    table = list(name = indata, where = '_PartInd_ = 0'),
    target = target,
    inputs = inputs,
    nominals = nominals,
    varImp = TRUE,
    casOut = list(name = 'dt_model', replace = TRUE)
)


$ModelInfo
A casDataFrame: 11 × 2
DescrValue
<chr><dbl>
Number of Tree Nodes 17.00000
Max Number of Branches 2.00000
Number of Levels 6.00000
Number of Leaves 9.00000
Number of Bins 20.00000
Minimum Size of Leaves 5.00000
Maximum Size of Leaves 3139.00000
Number of Variables 24.00000
Confidence Level for Pruning 0.25000
Number of Observations Used 4172.00000
Misclassification Error (%) 13.85427
$DTreeVarImpInfo
A casDataFrame: 5 × 4
VariableImportanceStdCount
<chr><dbl><dbl><dbl>
DEBTINC 462.3418025169.4143083
DELINQ 47.5949825 0.0000001
CLNO 10.6693789 3.2772022
DEROG 9.2955776 0.0000001
IMP_VALUE 0.3393939 0.0000001
$OutputCasTables
A casDataFrame: 1 × 4
casLibNameRowsColumns
<chr><chr><dbl><dbl>
CASUSER(sasdemo05)dt_model1727

In [34]:
# Score Test Data 
cas.decisionTree.dtreeScore(conn,
    modelTable=list(name='dt_model'),
    table=list(name=indata),
    copyVars= list(target, '_PartInd_'),  
    assessOneRow=TRUE, 
    casOut = list(name = 'dt_scored', replace = T)
)


$OutputCasTables
A casDataFrame: 1 × 4
casLibNameRowsColumns
<chr><chr><dbl><dbl>
CASUSER(sasdemo05)dt_scored596016
$ScoreInfo
A casDataFrame: 3 × 2
DescrValue
<chr><chr>
Number of Observations Read 5960
Number of Observations Used 5960
Misclassification Error (%) 13.674496644

In [35]:
# Assess Performance 
dt_assessed <- cas.percentile.assess(conn,
                table = list(name = 'dt_scored', where = '_PartInd_ = 1'),
                inputs = '_DT_P_           1',
                response = target,
                event = '1')
dt_roc <- dt_assessed$ROCInfo
dt_roc$Model <- "CAS Decision Tree"

CAS Autotuned Decision Tree


In [36]:
# Find Best Decision Tree Configuration
tune_dt <- cas.autotune.tuneDecisionTree(conn, 
            trainOptions=list(table = list(name = indata, where = '_PartInd_ = 0'),
            target = target,
            inputs = inputs,
            nominals = nominals,
            varImp = TRUE,
            casOut = list(name = 'tune_dt_model', replace = TRUE)))

In [37]:
# Score Test Data 
cas.decisionTree.dtreeScore(conn,
    table=list(name = indata),
    modelTable='tune_dt_model',
    copyVars     = list(target, '_PartInd_'),      
    assessonerow = TRUE,
    casOut       = list(name = 'tune_dt_scored', replace = T)
)


$OutputCasTables
A casDataFrame: 1 × 4
casLibNameRowsColumns
<chr><chr><dbl><dbl>
CASUSER(sasdemo05)tune_dt_scored596016
$ScoreInfo
A casDataFrame: 3 × 2
DescrValue
<chr><chr>
Number of Observations Read 5960
Number of Observations Used 5960
Misclassification Error (%) 10.822147651

In [38]:
# Assess Performance 
tune_dt_assessed <- cas.percentile.assess(conn,
                table = list(name = 'tune_dt_scored', where = '_PartInd_ = 1'),
                inputs = '_DT_P_           1',
                response = target,
                event = '1')
tune_dt_roc <- tune_dt_assessed$ROCInfo
tune_dt_roc$Model <- "CAS Autotuned Decision Tree"

CAS Gradient Boosting


In [39]:
# Gradient Boosting
cas.decisionTree.gbtreeTrain(conn,
    table = list(name = indata, where = '_PartInd_ = 0'),
    target = target,
    inputs = inputs,
    nominals = nominals,
    casOut = list(name = 'gbt_model', replace = TRUE)
)


$ModelInfo
A casDataFrame: 18 × 2
DescrValue
<chr><dbl>
Number of Trees 50.0
Distribution 2.0
Learning Rate 0.1
Subsampling Rate 0.5
Number of Selected Variables (M) 24.0
Number of Bins 50.0
Number of Variables 24.0
Max Number of Tree Nodes 31.0
Min Number of Tree Nodes 21.0
Max Number of Branches 2.0
Min Number of Branches 2.0
Max Number of Levels 5.0
Min Number of Levels 5.0
Max Number of Leaves 16.0
Min Number of Leaves 11.0
Maximum Size of Leaves 1895.0
Minimum Size of Leaves 5.0
Random Number Seed 0.0
$OutputCasTables
A casDataFrame: 1 × 4
casLibNameRowsColumns
<chr><chr><dbl><dbl>
CASUSER(sasdemo05)gbt_model139232

In [40]:
# Score Test Data
cas.decisionTree.gbtreeScore(conn,
    table=list(name = indata),
    modelTable='gbt_model',
    copyVars     = list(target, '_PartInd_'),      
    assessonerow = TRUE,
    casOut       = list(name = 'gbt_scored', replace = T)
)


$OutputCasTables
A casDataFrame: 1 × 4
casLibNameRowsColumns
<chr><chr><dbl><dbl>
CASUSER(sasdemo05)gbt_scored59608
$ScoreInfo
A casDataFrame: 3 × 2
DescrValue
<chr><chr>
Number of Observations Read 5960
Number of Observations Used 5960
Misclassification Error (%) 8.4731543624
$ErrorMetricInfo
A casDataFrame: 50 × 8
TreeIDTreesNLeavesMCRLogLossASERASEMAXAE
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
0 1 130.199496640.45831750.145422610.38134320.8162591
1 2 280.199496640.42893210.134381900.36658140.8296554
2 3 440.199496640.40508400.125224650.35387090.8428032
3 4 600.199496640.38592730.117722080.34310650.8550105
4 5 740.199496640.37047770.111812570.33438390.8662366
5 6 900.174664430.35620040.106433940.32624210.8759018
6 71060.145302010.34427560.101892940.31920670.8837723
7 81200.133221480.33407080.098151330.31329110.8926982
8 91350.118791950.32500840.094852410.30798120.9000767
9101500.113255030.31663430.091962140.30325260.9064096
10111650.109899330.31024720.089828830.29971460.9124400
11121810.107046980.30336790.087532380.29585870.9194278
12131950.106040270.29767860.085761770.29285110.9261137
13142110.105201340.29310480.084364540.29045570.9314005
14152250.104362420.28838620.082923630.28796460.9353805
15162410.102181210.28446760.081847190.28608950.9406621
16172550.102852350.28023900.080685400.28405180.9449353
17182710.102516780.27685950.079692090.28229790.9489485
18192870.101342280.27365840.078818520.28074640.9531432
19203010.101006710.27081830.078036020.27934930.9569792
20213130.101342280.26784950.077068160.27761150.9583204
21223270.100167790.26505890.076247690.27612980.9614470
22233420.100000000.26221130.075478450.27473340.9632765
23243580.098657720.25891260.074457940.27286980.9643662
24253720.098993290.25655010.073743770.27155800.9657733
25263850.097818790.25444120.073158390.27047810.9670358
26273980.097483220.25254270.072672080.26957760.9686246
27284120.097147650.25035990.072000300.26832870.9691852
28294260.097147650.24815230.071353340.26712050.9699781
29304400.095805370.24587250.070652650.26580570.9714686
30314560.094463090.24408460.070185870.26492620.9727930
31324700.094127520.24226310.069604150.26382600.9735549
32334830.093456380.24035320.069031000.26273750.9742737
33344940.093288590.23873000.068558100.26183600.9745890
34355080.092953020.23727820.068119720.26099760.9756157
35365230.092114090.23528880.067561680.25992630.9741923
36375380.090604030.23289130.066891580.25863410.9759766
37385500.089597320.23163560.066527930.25793010.9780755
38395630.089765100.23031450.066132250.25716190.9791260
39405790.089261740.22784120.065352820.25564200.9794762
40415910.089093960.22658140.064925430.25480470.9786891
41426030.088422820.22561980.064646580.25425690.9789643
42436170.088758390.22431080.064236750.25344970.9790753
43446320.088926170.22304720.063860860.25270710.9801133
44456480.087751680.22136310.063451530.25189590.9817142
45466640.087080540.22006680.063083010.25116330.9822180
46476800.086912750.21834150.062577040.25015400.9829572
47486960.085738260.21691100.062131750.24926240.9836021
48497070.085402680.21618720.061881250.24875940.9836874
49507210.084731540.21519430.061581020.24815520.9847375

In [41]:
# Assess Performance 
gbt_assessed <- cas.percentile.assess(conn,
                table = list(name = 'gbt_scored', where = '_PartInd_ = 1'),
                inputs = '_gbt_P_           1',
                response = target,
                event = '1')
gbt_roc <- gbt_assessed$ROCInfo
gbt_roc$Model <- "CAS Gradient Boosting"

Assess Models

We have built our seven models, now let’s us see how they compare to each other.


In [42]:
roc.df <- data.frame()
# Add R Models 
roc.df <- rbind(roc.df, rlog_roc, rtree_roc, rboost_roc)
# Add CAS Models 
roc.df <- rbind(roc.df, log_roc, dt_roc, tune_dt_roc, gbt_roc)

Confusion Matrix


In [43]:
# Manipulate the dataframe
compare <- subset(roc.df, round(roc.df$CutOff, 2) == 0.5)
rownames(compare) <- NULL
cf <- compare[,c('Model','TP','FP','FN','TN')]
cf


A data.frame: 7 × 5
ModelTPFPFNTN
<chr><dbl><dbl><dbl><dbl>
R Logistic Regression 112 512361389
R Decision Tree 223 921251348
R Gradient Boosting 231 401171400
CAS Logistic Regression 109 492391391
CAS Decision Tree 264153 841287
CAS Autotuned Decision Tree227 901211350
CAS Gradient Boosting 221 471271393

Misclassification


In [44]:
# Build a dataframe to compare the misclassification rates
compare$Misclassification <- 1 - compare$ACC
miss <- compare[order(compare$Misclassification), c('Model','Misclassification')]
rownames(miss) <- NULL
miss


A data.frame: 7 × 2
ModelMisclassification
<chr><dbl>
R Gradient Boosting 0.08780761
CAS Gradient Boosting 0.09731544
CAS Autotuned Decision Tree0.11800895
R Decision Tree 0.12136465
CAS Decision Tree 0.13255034
R Logistic Regression 0.16051454
CAS Logistic Regression 0.16107383

Notice the improvement in our autotuned decision tree above. It beats out both our R decision tree and our decision default decision tree.

ROC


In [45]:
# Add a new column to be used as the ROC curve label
roc.df$Models <- paste(roc.df$Model, round(roc.df$C, 3), sep = ' - ')

# Create the ROC curve
ggplot(data = roc.df[c('FPR', 'Sensitivity', 'Models')],
    aes(x = as.numeric(FPR), y = as.numeric(Sensitivity), colour = Models)) +
    geom_line() +
    labs(x = 'False Positive Rate', y = 'True Positive Rate')


End the session


In [46]:
# End the session
cas.session.endSession(conn)


We have gone through an example of using SAS Viya with R for analysis. We connected to our CAS server, imported, explored, and cleaned our data, we built three models in R, three in CAS, and one autotuned model in CAS, and we ended by examining our model’s misclassification rates and ROC curves. And ultimately, we learned how easy it is to leverage SAS and R using SWAT, allowing programmers to utilize the power of SAS Analytics from the language they are comfortable in.